[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan#20973
[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan#20973WeichenXu123 wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #88873 has finished for PR 20973 at commit
|
|
Test build #88885 has finished for PR 20973 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Thanks for the PR!
For the tests, can you please structure them as follows?
- If a test dataset is only used in one test, it's fine to put it in the test() call itself, rather than in a shared location.
- If you copied a test from spark.mllib, please copy other info for that test, such as R code to reproduce the expected results.
Thanks!
There was a problem hiding this comment.
We never want to use default arguments in Scala APIs since they are not Java-friendly. Let's just state recommended values in the docstrings. We can add defaults when we create an Estimator.
There was a problem hiding this comment.
Let's fix this phrasing by just saying "the maximal length of the sequential pattern" (The other part does not make sense: "any pattern that appears...") Feel free to fix that in the old API doc too.
There was a problem hiding this comment.
Be very explicit about the output schema please: For each column, provide the name and DataType.
There was a problem hiding this comment.
rename: findFrequentSequentPatterns -> findFrequentSequentialPatterns
There was a problem hiding this comment.
We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation.
There was a problem hiding this comment.
Let's check the input schema and throw a clear exception if it's not OK.
There was a problem hiding this comment.
It'd be nice to document that rows with nulls in this column are ignored.
There was a problem hiding this comment.
You could add a unit test for that too.
There was a problem hiding this comment.
nit: I'd prefer to call the column "frequency"
|
Test build #4158 has finished for PR 20973 at commit
|
dc7d779 to
acbf9e4
Compare
|
Test build #89836 has finished for PR 20973 at commit
|
|
Test build #89837 has finished for PR 20973 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Thanks for the updates! They look good. I just noticed one small mistake I made in the original review.
| * @return A `DataFrame` that contains columns of sequence and corresponding frequency. | ||
| * The schema of it will be: | ||
| * - `sequence: Seq[Seq[T]]` (T is the item type) | ||
| * - `frequency: Long` |
There was a problem hiding this comment.
I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq." It would be best to maintain consistency. Would you mind reverting to "freq?"
|
LGTM pending jenkins tests |
|
Test build #4162 has finished for PR 20973 at commit
|
|
Rerunning tests in case the R CRAN failure was from flakiness |
|
Test build #4165 has finished for PR 20973 at commit
|
|
Jenkins, test this please. |
|
Test build #90040 has finished for PR 20973 at commit
|
|
Jenkins, test this please. |
|
Test build #90116 has finished for PR 20973 at commit
|
|
Merging with master |
| @Since("2.4.0") | ||
| def findFrequentSequentialPatterns( | ||
| dataset: Dataset[_], | ||
| sequenceCol: String, |
There was a problem hiding this comment.
@WeichenXu123 @jkbradley The static method doesn't scale with parameters. If we add a new param, we have to keep the old one for binary compatibility. Why not using setters? I think we only need to avoid using fit and transform names.
There was a problem hiding this comment.
I agree with using setters. @jkbradley What do you think of it ?
There was a problem hiding this comment.
I agree in general, but I don’t think it’s a big deal for PrefixSpan. I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions. This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan. Once we have a Model, I recommend we deprecate the current static method.
I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on. I'd suggest "PrefixSpanBuilder."
There was a problem hiding this comment.
It should be easier to keep the PrefixSpan name and make it an Estimator later. For example:
final class PrefixSpan(override val uid: String) extends Params {
// param, setters, getters
def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
}Later we can add Estimator.fit and PrefixSpanModel.transform. Any issue with this approach?
There was a problem hiding this comment.
this way final class PrefixSpan(override val uid: String) extends Params seemingly breaks binary compatibility if later we change it into an estimator ?
There was a problem hiding this comment.
Adding extends Estimator later should only introduce new methods to the class but no breaking changes.
There was a problem hiding this comment.
Oh, I think you're right @mengxr . That approach sounds good.
There was a problem hiding this comment.
@WeichenXu123 Do you have time to send a PR to update this API?
There was a problem hiding this comment.
Sure. Will update soon!
What changes were proposed in this pull request?
PrefixSpan API for spark.ml. New implementation instead of #20810
How was this patch tested?
TestSuite added.